With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, energy efficiency plays an important role for AI model developers and infrastructure operators alike. The energy consumption of AI workloads depends on the model implementation and the utilized hardware. Therefore, accurate measurements of the power draw of AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. To this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes. Our results indicate that 1. deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately - while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.
translated by 谷歌翻译
Transformers have recently gained attention in the computer vision domain due to their ability to model long-range dependencies. However, the self-attention mechanism, which is the core part of the Transformer model, usually suffers from quadratic computational complexity with respect to the number of tokens. Many architectures attempt to reduce model complexity by limiting the self-attention mechanism to local regions or by redesigning the tokenization process. In this paper, we propose DAE-Former, a novel method that seeks to provide an alternative perspective by efficiently designing the self-attention mechanism. More specifically, we reformulate the self-attention mechanism to capture both spatial and channel relations across the whole feature dimension while staying computationally efficient. Furthermore, we redesign the skip connection path by including the cross-attention module to ensure the feature reusability and enhance the localization power. Our method outperforms state-of-the-art methods on multi-organ cardiac and skin lesion segmentation datasets without requiring pre-training weights. The code is publicly available at https://github.com/mindflow-institue/DAEFormer.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
In this paper, we introduce a novel network that generates semantic, instance, and part segmentation using a shared encoder and effectively fuses them to achieve panoptic-part segmentation. Unifying these three segmentation problems allows for mutually improved and consistent representation learning. To fuse the predictions of all three heads efficiently, we introduce a parameter-free joint fusion module that dynamically balances the logits and fuses them to create panoptic-part segmentation. Our method is evaluated on the Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets. For CPP, the PartPQ of our proposed model with joint fusion surpasses the previous state-of-the-art by 1.6 and 4.7 percentage points for all areas and segments with parts, respectively. On PPP, our joint fusion outperforms a model using the previous top-down merging strategy by 3.3 percentage points in PartPQ and 10.5 percentage points in PartPQ for partitionable classes.
translated by 谷歌翻译
This white paper lays out a vision of research and development in the field of artificial intelligence for the next decade (and beyond). Its denouement is a cyber-physical ecosystem of natural and synthetic sense-making, in which humans are integral participants$\unicode{x2014}$what we call ''shared intelligence''. This vision is premised on active inference, a formulation of adaptive behavior that can be read as a physics of intelligence, and which inherits from the physics of self-organization. In this context, we understand intelligence as the capacity to accumulate evidence for a generative model of one's sensed world$\unicode{x2014}$also known as self-evidencing. Formally, this corresponds to maximizing (Bayesian) model evidence, via belief updating over several scales: i.e., inference, learning, and model selection. Operationally, this self-evidencing can be realized via (variational) message passing or belief propagation on a factor graph. Crucially, active inference foregrounds an existential imperative of intelligent systems; namely, curiosity or the resolution of uncertainty. This same imperative underwrites belief sharing in ensembles of agents, in which certain aspects (i.e., factors) of each agent's generative world model provide a common ground or frame of reference. Active inference plays a foundational role in this ecology of belief sharing$\unicode{x2014}$leading to a formal account of collective intelligence that rests on shared narratives and goals. We also consider the kinds of communication protocols that must be developed to enable such an ecosystem of intelligences and motivate the development of a shared hyper-spatial modeling language and transaction protocol, as a first$\unicode{x2014}$and key$\unicode{x2014}$step towards such an ecology.
translated by 谷歌翻译
State-of-the-art object detectors are fast and accurate, but they require a large amount of well annotated training data to obtain good performance. However, obtaining a large amount of training annotations specific to a particular task, i.e., fine-grained annotations, is costly in practice. In contrast, obtaining common-sense relationships from text, e.g., "a table-lamp is a lamp that sits on top of a table", is much easier. Additionally, common-sense relationships like "on-top-of" are easy to annotate in a task-agnostic fashion. In this paper, we propose a probabilistic model that uses such relational knowledge to transform an off-the-shelf detector of coarse object categories (e.g., "table", "lamp") into a detector of fine-grained categories (e.g., "table-lamp"). We demonstrate that our method, RelDetect, achieves performance competitive to finetuning based state-of-the-art object detector baselines when an extremely low amount of fine-grained annotations is available ($0.2\%$ of entire dataset). We also demonstrate that RelDetect is able to utilize the inherent transferability of relationship information to obtain a better performance ($+5$ mAP points) than the above baselines on an unseen dataset (zero-shot transfer). In summary, we demonstrate the power of using relationships for object detection on datasets where fine-grained object categories can be linked to coarse-grained categories via suitable relationships.
translated by 谷歌翻译
Object permanence is the concept that objects do not suddenly disappear in the physical world. Humans understand this concept at young ages and know that another person is still there, even though it is temporarily occluded. Neural networks currently often struggle with this challenge. Thus, we introduce explicit object permanence into two stage detection approaches drawing inspiration from particle filters. At the core, our detector uses the predictions of previous frames as additional proposals for the current one at inference time. Experiments confirm the feedback loop improving detection performance by a up to 10.3 mAP with little computational overhead. Our approach is suited to extend two-stage detectors for stabilized and reliable detections even under heavy occlusion. Additionally, the ability to apply our method without retraining an existing model promises wide application in real-world tasks.
translated by 谷歌翻译
Tourette Syndrome (TS) is a behavior disorder that onsets in childhood and is characterized by the expression of involuntary movements and sounds commonly referred to as tics. Behavioral therapy is the first-line treatment for patients with TS, and it helps patients raise awareness about tic occurrence as well as develop tic inhibition strategies. However, the limited availability of therapists and the difficulties for in-home follow up work limits its effectiveness. An automatic tic detection system that is easy to deploy could alleviate the difficulties of home-therapy by providing feedback to the patients while exercising tic awareness. In this work, we propose a novel architecture (T-Net) for automatic tic detection and classification from untrimmed videos. T-Net combines temporal detection and segmentation and operates on features that are interpretable to a clinician. We compare T-Net to several state-of-the-art systems working on deep features extracted from the raw videos and T-Net achieves comparable performance in terms of average precision while relying on interpretable features needed in clinical practice.
translated by 谷歌翻译
从教育和研究的角度来看,关于硬件的实验是机器人技术和控制的关键方面。在过去的十年中,已经介绍了许多用于车轮机器人的开源硬件和软件框架,主要采用独轮车和类似汽车的机器人的形式,目的是使更广泛的受众访问机器人并支持控制系统开发。独轮车通常很小且便宜,因此有助于在较大的机队中进行实验,但它们不适合高速运动。类似汽车的机器人更敏捷,但通常更大且更昂贵,因此需要更多的空间和金钱资源。为了弥合这一差距,我们介绍了Chronos,这是一种具有定制开源电子设备的新型汽车的1/28比例机器人,以及CRS是用于控制和机器人技术的开源软件框架。 CRS软件框架包括实施各种最新的算法,以进行控制,估计和多机构协调。通过这项工作,我们旨在更轻松地使用硬件,并减少启动新的教育和研究项目所需的工程时间。
translated by 谷歌翻译
高参数优化(HPO)是一个良好的研究领域。但是,HPO管道中组件的效果和相互作用尚未得到很好的研究。然后,我们问自己:HPO的景观是否会被用于评估单个配置的管道偏见吗?为了解决这个问题,我们建议使用健身景观分析分析HPO管道对HPO问题的影响。特别是,我们研究了DS-2019 HPO基准数据集,寻找可能表明评估管道故障的模式,并将其与HPO性能联系起来。我们的主要发现是:(i)在大多数情况下,大量不同的超参数(即多种配置)产生相同的不良绩效,很可能与多数类预测模型有关; (ii)在这些情况下,观察到观察到的健康和平均健身之间存在恶化的相关性,可能会使基于本地搜索的HPO策略的部署更加困难。最后,我们得出的结论是,HPO管道定义可能会对HPO景观产生负面影响。
translated by 谷歌翻译